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Abstract 

Bayesian Belief Networks are a powerful tool for combining different knowledge sources with various degrees of 
uncertainty in a mathematical sound and computationally efficient way. Surprisingly they have not yet found their 
way into the speech processing field, despite the fact that in this science multiple unreliable information sources 
exist. The present paper shows how the theory can be utilized in for language modeling. After providing an 
introduction to the theory of Bayesian Networks, we develop several extensions to the classic theory by describing 
mechanisms for dealing with statistical dependence among daughter nodes (usually assumed to be conditionally 
independent) and by providing a learning algorithm based on the EM-algorithm with which the probabilities of link 
matrices can be learned from example data. Using these extensions a language model for speech recognition based 
on a context-free framework is constructed. In this model, sentences are not parsed in their entirety, as is usual with 
grammatical description, but only "locally" on suitably located segments. The model was evaluated over a text data 
base. In terms of test set entropy the model performed at least as good as the bi/tri-gram models, while showing a 
good ability to generalize from training to test data. 

Zusammenfassung 

Bayessche Belief Netzwerke sind niitzliche Werkzeuge um auf mathematisch rigorose rechnerisch nicht zu 
aufwendige Weise verschiedene Wissensquellen mit unterschiedlichem Grad an Unsicherhcit miteinander zu 
verbinden. Erstaunlicherweise werden sie bis heute kaum in der Spracherkennung angewendet, obwohl gerade in 
diesem Gebiet verschieden unsichere Wissensquellen betrachtet und ausgenutzt werden mussen um brauchbare 
Erkennungsraten zu erzielen. Nach einer Einfuhrung in die Theoric der Bayesschen Netze werden einige fur die 
Sprachmodellierung notwendige Erweiterungen beschrieben. Insbesondere wird auf die statistische Unabhangigkeit 
der direkten Folgen einer Ursache verzichtet und ein Lernverfahren basierend auf dem EM- Algorithmus beschrieben. 
Hiermit konnen die Modcllparameter auch bei unvollstandigen Trainingsmaterial gelernt werden. Mit Hilfe diescr 
Erweiterungen ist es moglich ein Sprachmodetl basierend auf stochastischen Kontext-freien Grammatikregeln zu 
realisieren. Ein solches Model wird beschrieben und experimentell auf einer Textdatenbank evaluiert. Hierbei zcigte 
das Model einen mindestens so guten Sprach-Entropie wie die «-gram Modelle und bewies eine gute Fahigkeit vom 
Trainingsmaterial auf das Testmaterial zu verallgemeinern. 



' This paper is based on a communication presented at the ESCA Conference EUROSPEECH-93 and has been recommnended 
by the EUROSPEECH-93 scientific program committee. 
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Resume 



Les reseaux bayesiens de modelisation de la croyance constituent un outil efficace pour combiner differentes 
sources d'information ayant differents degres d'incertitude. lis sont rigoureux mathematiquement et peuvent etre 
implemented de faqon efficace. lis sont malheureusement peu utilises en traitement de la parole. Cet article montre 
comment ils peuvent etre appliques a ce domaine. Apres une introduction aux reseaux bayesiens, nous proposons 
certaines extensions de la theorie classique. Nous decrivons comment rendrc compte de la dependance statistique 
entre les noeuds-fils, d'habitude considered comme independants. Nous donnons aussi un algorithme, inspire de 
1'algorithme EM, qui permet l'apprentissage de matrices de liens a partir d'exemples. A I'aide de ces extensions, 
nous construisons un modele de langage a base "context-free" pour la reconnaissance de la parole, dans lequel les 
phrases sont analysees localement, en segments judicieusement choisis, au lieu d'etre analysees dans leur totalite. Ce 
modele a ete evalue sur une base de donnees textuelles. Sa performance, mesuree par Tentropie de ['ensemble de 
test, est au minimum egale a celle des modeles bi- ou tri-grammes; nous avons aussi constate une bonne capacite de 
generalisation des donnees d'apprentissagc vers les donnees de test. 

Keywords: Bayesian Networks; Grammar inference; Stochastic parsing 



1. Introduction 

Robust automatic speech recognition requires the use, and hence combination, of several uncertain 
information sources. Firstly the acoustic signal of an utterance, being constrained by the physical 
movements of the articulators and contaminated with noise, contains a high degree of uncertainty. Even 
after processing by acoustic models such as hidden Markov models the degree of uncertainty in lists of 
candidate words or word lattices remains high. Language models are used to further improve the 
selection of the word strings. These models are usually themselves expressed probabilistically. Using 
semantic and pragmatic information to further assist the speech recognition process has also been 
proposed and again such a scheme would rely on the combination of various uncertain information 
sources. 

As for language models, the rz-gram model has been remarkably successful (Jelinek, 1991). Most 
popular are the so-called bigram and trigram models, although through a tree-based clustering method it 
has been possible to extend these to 21-grams (Bahl et al., 1989). It is somehow surprising that without 
performing any structural analysis, as advocated by traditional linguistics, such models can perform so 
well. Their remarkable success can only be explained by their solid (if simple) mathematical basis. 

In this paper we propose a language model which is based on structural analysis. This can become 
useful when structural analysis is required to extract meaning or to interact with semantic or even 
pragmatic knowledge sources. The basic formalism adopted is based on Bayesian belief networks, a 
stochastic inference mechanism mainly developed for probabilistic expert systems. It is anticipated that 
this formalism will make it easier to incorporate further knowledge sources, although this is not 
attempted in this paper. Here we concentrate mainly on the theoretical background necessary in 
adopting Bayesian belief networks for linguistic analysis. We will describe an iterative training method 
that learns both structure and probabilities in an unsupervised fashion. Convergence of the algorithm will 
be shown. 

From a linguistic view point, the algorithm assumes context-freeness of the language and learns 
probabilistic production rules that are best supported by the training data. As such, the algorithm falls in 
the class of grammatical inference algorithms. Prior work in this area includes many symbolic (e.g. 
(Anderson, 1981; Berwick, 1980; Wolff, 1980, 1982)), connectionist (McClelland and Rumelhart, 1987) 
and probabilistic (Baker, 1979) approaches. A good survey of earlier work was written by Fu and Booth 
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Fig. 1. A typical Bayesian network for use in a medical expert system (after (Cowel el al., 1993)). 

(1986). Some limits on the learnability of unconstrained context-free languages in the non-probabilistic 
case are also known (Gold. 1967). 



2. Bayesian Belief Networks 

Bayesian Networks (influence diagrams) are a tool for calculating posterior probability distributions 
over sets of random variables. For a good treatment see (Pearh 1987). They have been studied in the 
artificial intelligence literature with the aim of producing expert systems that are capable of dealing with 
uncertain information. For example in a medica! application a Bayesian Network may relate observable 
symptoms to unobservable causes (i.e. physiological conditions). A typical such network is shown in Fig. 
1. 

In such a network each node represents a random variable. Pearl notes that if the underlying graph is 
a tree then there exists a simple algorithm for calculating the posterior probability distribution of each 
variable in the network. The aim of this paper is to show that this theory can easily be applied to 
statistical modeling in speech where we similarly have observable "symptoms" such as the waveform or 
surface word string and unobservable (or hidden) variables such as HMM states or the linguistic units 
(like "noun phrase", etc). 

A typical Bayesian network in the form of a tree is shown in Fig. 2. Each node corresponds to a 
random variable. For simplicity we will assume that they are all discrete. Some of the variables may be 
observable, others may not be. For a given observation, the states of the observable nodes are called the 
evidence and are denoted e. The arrows in the graph indicate the assumed causal influences. Thus an 
arrow pointing from node A to node B means that A has a direct causal influence on B, This 
relationship is quantified in a matrix Pr(B\ A) which gives for each possible value of A the probability 
distribution of B, Our knowledge base then consists of the directed graph, the link matrices for each link 
and the prior probability distribution PxiR) of the root node. Given this knowledge base and an 
observation (instantiation of the observable nodes) we would then like to calculate the posterior 
probability distribution of all the unobservable nodes A, i.e. find the distributions Pr{A \e). 

Pearl describes a simple and efficient algorithm for calculating these probability distributions. It is 
based on the propagation of two vectors known as the diagnostic and causal support vectors through the 
network. 
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Fig. 2. A typical causal tree diagram. 



In the remainder of this introduction we will define the notation and quote the propagation equations. 
In Section 4 we will generalize them in a form suitable for the experimental work and will supply the 
necessary derivations. We will also provide a training method for learning the link matrices automatically 
even between unobserved nodes. In Section 5 the convergence of the training method will be proved. We 
will then describe how the presented ideas could be used for learning a language model based on a 
context-free paradigm (Section 6) and report experimental results (Section 7). 



Conditional probabilities: The derivations in this paper make use of conditional prob- 
abilities of the form l*r(,4|#). Wo will use capital letters A,tt,C to indicate nodes 
in the networks. A node A ran be thought of as a discrete random variable taking 

values ai,a 2 , If one or more of these node variables enters the argument of the 

probability function P the resulting expression is to be interpreted as a tensor. For 
example the expression Pr{ej} | L') is a vector. It has as many components as V can 
take values. The t'th component, is Vr{e.J f | {/ = «,■). Similarly Vv{f:J,.V \ \\ X ,e. J) is 
a tensor of rank 3 whose ijk\h component is Pr(cj),(/ = ti; | V =Wj,A" ~Tk,e'x). 

Important vectors and their definitions: 

\{U) = Pr(<?y I (diagnostic support vector] 

t(U) = Pr(( ; J ejy ) (causal support vector) 

BEL((/) = Pr{U | c) (Belief vector) 



Vector products: 

ab 

\{U)?t{U | V) 



componentwise vector product: (a&), — a,6 t 
familiar dot product: a • b = J2i 

vector- matrix product. Vector matrix product of this form are 
performed in the only sensible way, i.e. identifying states of the 
same variable. So if t indices states of the variable U and j indices 
states of the variable V then 

\{U)Pt{u\v) 3 = {YiHU)iPt(u=i\v=j)y 

i 

Thus in general we will not make transpositions of vectors or ma- 
trices explicit. 

Normalizing constants: When the letter o appears in front of a vector it represents a 
real number normalizing that vector. Occasionally a occurs more than once in an 
equation. In this case they usually represent different normalization factors. 



Fig. 3. Notation used in this paper. 



4 



H. Lucke/ Speech Communication 16 (1995) 89-118 93 

2. 1. Belief calculation by graph propagation equations 

We will begin by defining some notation. Random variables will be written by upper case letters. 
These usually correspond to the nodes in the graph and will be identified with these. The evidence e 
stands for the total observed information (values of the observed random variables). Following Pearl, for 
a node X we write e^ to indicate the part of the evidence that is connected to one of the descendants of 
X (i.e. that can be reached from X by walking only in the direction of the arrows in the graph) and e% 
the remaining evidence e\e#. Some additional notation is summarized in Fig. 3. 

Furthermore we define for each node X of the diagram two vectors: The diagnostic support vector, 

K{X) =Pr(e A " U) = (Pr(e;. I *=*,)), (1) 

and the causal support vector, 

it ( X ) = Pr( X | et ) = (Pr( X = x t I e + x ) ) . (2) 

The essence of the propagation theory lies in the fact that these quantities can be "propagated" 
through the underlying graph and thereby calculated recursively. In order to state these recursions we 
define two auxiliary vectors. If V is a parent of nodes U^...,U k (see Fig. 4), we define 

Ai, r (^) = Pr(eJJK) t (3) 
Tr Ur (V)=Pr(V\eZ,). (4) 

One can now show (for a derivation see (Pearl, 1987) or Section 4.1 where we derive more general 
versions of these equations): 

Ao;(^)=A(t7 r )Pr(^|K), (5) 

*-„.( !/)-«■( v) nw)- ( 7 > 

ir(t/,)-Pr(l/,|K)ir l/r (k'). (8) 




Fig. 4. Illustration of the propagation mechanism. 
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Here the products on the right-hand-side of Eqs. (5) and (8) are familiar vector-matrix products. The 
vector product involved in Eqs, (6) and (7) is the component-wise vector product (i.e. vector x vector = 
vector). The coefficient a in Eq. (7) is a scalar chosen such that the vector tt^CK) on the left-hand-side is 
normalized. 

Using Eqs. (6) and (8) it is thus possible to calculate the A and v vectors of all nodes in the network 
from the n vector of the root node (node H in our example) and the A vectors of the other leaf nodes. 
The posterior probability distribution PrW \ e) of a node U is denoted BEL(LO and can be calculated as 

X(U)tt(U) 

BEL(f/) = Pr(U I e) = aA(U)ir(U) = \ / \ (9) 
where again the scalar a is chosen such as to normalize the resulting vector. 



3. Analysis of Bayesian belief propagation 

In the previous section we stated the propagation equations for Bayesian networks. Since the 
calculations are local at each node, it is clear that even complex dependencies can be calculated easily 
and accurately by following the structure of the tree. The theory as described above is only applicable to 
simple trees, i.e. causal structures in which each event can have at most one cause, however a 
generalization to poly-trees, where multiple causes exist for an event, is also available, but not discussed 
in this paper. 

How could an expert system make use of this approach for automated reasoning? According to the 
ideas put forward in the previous section, the knowledge base of such a system would consist of 
probability matrices of the form Pr{Y \ X) for events X and Y known or observed to be in a causal 
relationship and prior distributions Pr(X). For a given problem it would then be the task of the system to 
propose a plausible network structure connecting the observed events with other unobserved but relevant 
events. After a network is constructed, the belief propagation equations can be used to calculate the 
posterior probability distribution of the unobserved nodes of interest. 

Thus the expert system has to cope with two different problems: a qualitative one, consisting of the 
construction of the graphical structure and a quantitative one propagating the A and tt vectors through 
this network. 

Expert systems implementing the quantitative part have already been proposed by Lauritzen and 
Spiegelhalter (1988) and others, however these systems are not able to solve the first problem: the 
assignment of the graphical structure. Instead this structure is given in advance and hence forms part of 
the knowledge base. 

While the studies on such expert systems are interesting and help to demonstrate the propagation 
mechanism, they are of limited use in practice. It is unfeasible for an expert system to maintain a very 
large tree connecting all possible (observed and unobserved) events. Such a network would connect 
seemingly unrelated events. Moreover, it would require the propagation of the A and tt vectors from 
parts of the network that are only marginally related to the events of interest. It would also be extremely 
difficult to establish a large network connecting all nodes of interest and yet have it loop-free as required 
by the theory'. 

Instead one should look for an approach that constructs a network "on the fly" connecting nodes 
when they become relevant in the light of observed events. This would entail establishing the relevance 
of various observed and unobserved events on the basis of the known Pr(Y I X) matrices and construct- 
ing a network which connects the unobserved events of interest to the relevant observed events, possibly 
creating new (previously unknown) unobserved events in the process. We are not aware of any such 
algorithm reported in the literature. 
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Fig. 5. Bayesian tree interpretation of an HMM. 



In Section 6 we use the theory of Belief trees to stochastically parse a sentence. In this paradigm the 
words of the sentence form the observed events. Unobserved events are grammatical markers such as 
"noun phrase", "prepositional phrase", etc. It is the object of the parser to connect the possible 
unobserved events to the observed events in the "best possible way". This will create a tree structure (the 
parse tree). The theory of Bayesian networks will give us the quantitative tool for calculating the beliefs 
of all unobserved events and also the likelihood of the chosen tree structure. Thus the problem of finding 
the best parse tree will be the same as finding the best causal explanation of the observed evidence, i.e. it 
directly corresponds to the problem described in the previous paragraph for expert systems. 

3.1. Relationship to hidden Markov Models 

Hidden Markov Models can be viewed as Bayesian trees of a certain fixed topology. Fig. 5 shows such 
an interpretation. 

A number of events (the observation sequence) o M . . . , o,, . . . is observed. In addition, unobservable 
variables q u . . . , q t , . . . (describing the state occupancy at time /) are assumed to exist. The probabilistic 
dependencies between the variables are shown by the arrows and quantified by the two matrices A and 
B (assuming a discrete HMM). The A and rr vectors take the roles of the familiar backward and forward 
probabilities, respectively. The theory of Bayesian Networks is directly applicable because the structure 
of the network is fixed as shown. 

In contrast, the Inside-Outside algorithm, a grammatical inference algorithm proposed by Baker 
(1979), does not operate on Bayesian networks. Because of its similarity to the grammar inference 
algorithm described in this paper we will discuss this relationship in more detail in Section 6.11. 



4. Extensions to the basic belief propagation equations 

With regard to Section 6.2, we will now present a few extensions to the belief propagation equations. 
We will also provide the necessary derivations of all results. Since the equations quoted in the previous 
section are special cases of the results developed here, the proofs apply to the previous section as well. 

4. 1. Dependence of daughter nodes dependencies 

In the diagram we discussed in the introduction, the daughters of a given parent node were assumed 
to be conditionally independent (meaning independent once the value of the parent is known). 
Mathematically this is expressed for parent U with daughters V and X as 

?r(V,X\U = u) = Pr{V\U = u)Pr(X \U = u), (10) 

for each possible u. 
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The daughter nodes are divided into G 



Fig. 6. This figure shows a node V with daughters Uj.M*, U, 

groups as shown. The groups are regarded as statistically independent, but the nodes within each group are dependent. Hence the 
conditional probability distribution of the daughters given the parent has the product form shown in Eq. (11), 



The belief propagation equations can be modified to apply in situations when Eq. (10) does not hold. 
If there are dependency relations among the daughter nodes, it is not sufficient to specify each 
parent-daughter relation separately, but rather jointly using a tensor of rank 3 or higher. This tensor 
expresses the joint conditional probabilities between daughter nodes £/,,...,£/„ and parent K, viz. 
PK(/„...,L/ n \ V) instead of separate link matrices Pr(U l I V\ . . . , Pr(U„ \ V\ In the most general case, 
the daughters of V can be divided into, say, G groups. Nodes within one group are regarded as 
statistically independent from the nodes of any other group, so there is no need to specify the 
parent-daughter relation by one large tensor. However, the nodes within each group are considered to 
be statistically dependent and their relation to the parent is expressed by a tensor for this group. Thus if 
the nodes in group g are labeled U?,...,U* the overall conditional probability tensor has the form 



(11) 



Fig. 6 displays such a scenario graphically. 

The propagation equations can be adopted in a straightforward way to the new situation. For 
convenience we define some additional notation. We denote the group consisting of nodes Vf, . . . , U t f by 
V*. Further we define the evidence € v c as the combined evidence . . . , e*^c and likewise as the 
remaining evidence e\ey, t i.e. the entire evidence e less Furthermore we "define 



A(K*) = Pr(e^|K), 
tt(V*) = Pr(I/ki%), 
and we write A R to indicate the tensor for group g, i.e. 

A h, = H u ' -J'i U ' M = in, I »" = 0 • 



(12) 
(13) 

(14) 



4.2, Applicable laws of probability 



In order to derive Eqs. (5) to (8) and their generalizations we require three laws of probability theory. 
These are described below. 
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4.2. J. Bayes' law 

Bayes* law states that for events «, b % c we have 

Px{b\a,c)Px(a\c) 

Pf ""»""° Pr(Mc) • ( ' 51 

Now, let U be a random variable corresponding to a node of the network, so that in our notation 
Pr(U\b y c) is a vector. This vector must be normalized. We can then write Bayes* law in the following 
form: 

Pt(U I b y c) = aPx(b | V,c)Pr{V I c), (16) 
where a = 1/Piib \ c) can be determined simply as a normalizing constant. 

4.2.2. Conditioning 

Let a and b be events and U a random variable that can take a finite number of values 1, . . . , n. The 
following equation holds: 

Pr{a\b) - £Pr(*|t/,£),Pr(£/|&),. (17) 



4.2.3. Separation 

To illustrate the law of separation consider the probability vector Prie^ \e^,U). Since the graphical 
structure is assumed to be a tree, the value of e£ can effect the value of e[, only via the value of U. We 
say e£ and ej} are marginally independent, i.e independent when conditioned on the value of U. Hence 

Pr(*jk2,t/) = Pr(*J|£/). (18) 

We also say "U separates ef} from ej". 

4.3. Derivation of the propagation equations 

We will begin by proving a generalization of Eq. (5). The capital letters B, C, S and D on top of the 
equal sign ( = ) indicate applications of Bayes" Law, the laws of Conditioning and Separation and a 
definition, respectively. A bracketed numeral indicates an application of an earlier equation. Further- 
more, for readability, we will identify a value x, of a random variable X with the index / itself, writing 
PKA r = i) instead of PxiX = x i ). In fact, in Eq. (14), we already used this notation. 

Now, as for the As we have 

A(^),SPr(e?JI/-i) (19) 
2p r (e^,...,^|K=/) (20) 

= £ Pr(ej r . . . , e„ K W = i , Uf (/* ->„ J Af Jt ^ (21 ) 

i\ >«, 

= E npr(«wit//=y t )^fj, J. < 22 > 



/ Jr 



k=\ 



= e nnun^fj, (23) 



Jl. 
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A(K),S Pr ( e - (24) 
fiPr(ej-i....,cp«|^=i) (25) 

G 

= Y\?*(ey g \V=i) (26) 
* = i 

= nA(K*),. (27) 

Eq. (23) expresses the vector A(K*) as a simple tensor-vector product between the tensor A 8 and the 
A-vectors A(L^), k = Regarding Fig. 6 one could say that the vectors A(t//) enter the triangle 

(tensor) from the bottom. Here the tensor-vector product is formed and the A vector representing 
the group \{V g ) is emitted at the top. Eq. (27) shows how the A vectors from the various groups need to 
be combined to give the overall A vector: by component-wise vector product. One could write these two 
results in the vector form: 

W) • • • A (^)> (28) 

A(C)-nA(n (29) 

where the multiplication in Eq. (28) is a tensor-vector product (like a generalized matrix-vector product) 
and the multiplication in Eq. (29) is a componentwise vector product. However, since it is difficult to 
distinguish between the two forms of multiplication in this notation and it is also not easy to see which 
indices of the tensor A g identify with which A vector in Eq. (28), we will always use the more explicit 
component-wise description shown in Eqs. (23) and (27). 
For the its we derive 

ir(K») i Sp r (I/=i|e;«) (30) 
Spr(l/=,|e+, (31) 

la Pr(K=/, ey,,...,^.,...,eyo\ep) (32) 

= « Pr(v = i\et) Pr(v=i, e^< ,. . . , 4^, . . . , e„c \ , V = i) (33) 

Pr(l/=/|e£) U Pr(e^\V-i) (34) 

s'=i i c 

= «77(1/), n HV')i, (35) 

«'-! i G 

v(Uf) h °rr(Uf-j k \eZ t ) (36) 



2l>r[ug=j k \eZ., A eaA 

v k'*k } 



(37) 
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E [?r[u£=j k \V = i, ( A Ut=fc),el M , A J 
U A >». * v v *'"* 7 / 

xPr(v = /, A 1//-7VU?», A e^l) 

E [pr(u k «=j k \V=i, A 1//--;V) Pr(^=i) 
,4 >./ v *'** ' 

xPr( A Ufi-h-.et*. A ^fl« / =')l 
'Pr( A Ufi-j*\V-i) 

° l u .* /-.] Pr( A £//t-y*.|K-/) 
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Pr(K=j) 



(38) 



(39) 



XPr( A U$=j k .,et., A ^1^=') 



k*k 



(40) 



= B «. E If.,, in Pr(F=i)PrU, A^ f | A^^^^l (41) 

Hi 4 in, ' 



= «. E Af Jt ing Pr(K = i) Pr( e p 5 |l/=1) fl Pr(^ I ^ = ; /t .)(42) 

y H * =l * rt Jt 

S <*2 L ^f.y, ,^(^)/ 11 KV$)w («) 

— * = l * "* 

Here l,...i£...,n g stands for the integers from 1 to with the exception of k. The index k' 
generally ranges over these integers except in the numerator of Eq. (40) where it covers the full range 
\ 9 ...,n g . The constants a, and a 2 are given by 

1 Pr(e;*) 

(44) 



v k' + k 1 



and a 2 = 



Pr(e+„ A eu&) 



but since these constants are independent of j k and since we know that the vector tt(U£) is normalized, 
they can simply be determined as normalizing constants. 

Thus it is clear that even in the presence of statistical dependence among the daughter nodes, the A 
and 7T vectors can be propagated in a fairly straightforward fashion. 

4.4. Learning the conditional probabilities 

We will now turn to the problem of learning the conditional probabilities stored in the matrices from 
training data. A training sample is an instantiation of the observed nodes of a network. The training 
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problem is that of choosing the link matrices such that the overall probability of the training material (i.e. 
the product of probabilities of each sample given the link matrices) is maximized. We will distinguish two 
cases: the simple case in which all nodes are observed and the more complicated case involving matrices 
between unobserved nodes. 

4.4.2. Case 1: all nodes are observed 

Suppose we wish to find the link matrix Pr(Y \ X) between two nodes X and Y. In this case we simply 
instantiate a matrix of counters (C(X,Y) U ) which counts the number of times node X is in state / and 
node Y is in state j. After processing the entire training data and updating the respective counters, the 
maximum likelihood estimate of the matrix Pr(Y I X) is then given by 



Pr(y-y, | T 



C(A\y) 0 



2 r C{X,Y) ir 



(45) 



Now consider the slightly more complicated case in which a node V has n daughters structured in G 
groups, U 1 \U^...,U n \,U ] 2 ,...,U*^... 9 U£ i , that was discussed in Section 4.1. Here the relationship 
between the parent V and the daughters Uf,U£,...,U* in group g is expressed in the tensor shown in 
Eq. (14). If the nodes YMf,U£i...*U* are all observed we can again simply instantiate a tensor of 
counters C(V,Uf, and count the number of occurrences of each event over the training 

data. The maximum likefihood estimate for A s is then 



A 8 



>«. 



(46) 



4.4.2. Case 2: some nodes are unobserved 

If the nodes involved are unobserved, it is no longer possible to count simultaneous occurrences in a 
straightforward way. However, we can ask for the expected number of simultaneous occurrences. For a 
single sample e this expected number equals the probability 



Pr(y=i,Uf=j l ,...,U n g =j„Je). 



(47) 



Here again, we split the daughter nodes into different groups according to their dependencies and 
only consider one group as it is independent of all the others. Fortunately, there is an easy way to 
calculate the quantity in Eq. (47) using intermediate results from the belief propagation. We have 



Pr(V=i,U?=j i ,...,U n '=j Hm \e) 

£Pr(K=/|e) Pr(^ =;,,..., t//=;„JK=/, e) 

= BEL( V)t Pr(Uf =/,,..., U* =/„ f ll / = i, *v») 

Pr(ep. I V- i. U? -7, , ■ ■ • , Uj ->,,) Pr(t/f -/„ ■ - • , U' -j„ t W-i) 

Pr(ep,|K = i) 



(9),B 



(27).D 



(48) 
(49) 

(50) 
(51) 
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= a n A(K*'),.77(K), Pr(^ r eu S Wf = j l9 . . . , (/* ^'J^f.,, in (52) 

£-1 & G * * 

"x 

= S a7r( V), EI Pr(^ 1 )^f0. A < 53 > 

/c — I * 

= air{ K*), ft HUDj^f.,, v (54) 

Instead of counting joint occurrences, we now add the expected number of joint occurrences in the 
tensor C(V,Uf 9 ... J U*). Hence after processing all samples e in the data we obtain 

C(V,U«,... y U„S g ) = EPr(l/,^...,L^ k), (55) 

c 

where the sum denotes tensor addition. An estimate for the conditional probability tensor A* can now 
be obtained using Eq. (46). It should be noted that since Eq. (47) represents the expected number of joint 
occurrences for one sample, Eq. (54) represents the expected number of joint occurances over the entire 
training data. 

Note however that a previous estimate of A 8 is required in Eq. (54). Hence, unlike the "all nodes 
observed" case the tensors can only be learned iteratively. We therefore have the following learning 
algorithm: 

1. Choose initial (perhaps random) connection tensors. 

2. Process the training data once, accumulating the tensor (Pt(V = ^/p..., U*=j n \e)) in a 
"counting" tensor C(K,L/,*, {/„*), for each parameter tensor in the network. 

3. Normalize the C tensors and obtain the new estimates for the A tensors, viz. 

C(K^,...,f//) i7l ... y/i 

a* = - ! 7 Zi f56} 

E C(VM? "*),-/,...,;,' ^ ' 

4. Go back to step 2 until convergence is achieved. 

We will show in the next section that re-estimating the parameters in this way is guaranteed to 
increase the overall likelihood of the training data. Hence this iterative technique converges to a "local 
maximum likelihood" estimator. Like in the Baum-Welch algorithm, convergence to the true maximum 
likelihood estimator is not guaranteed. 

The entropy calculated over the training data can be used as an indicator to decide when convergence 
is achieved, It is advisable to smooth this over several iterations and discontinue training when the 
smoothed entropy no longer decreases significantly. 



5. A theorem about convergence of the iterative training method 

In this section we will show that the previously described method for updating the probability 
estimates of the link tensors always improves the overall likelihood of the training data. In order to state 
the theorem we need to do some preliminary work. 

Fig. 7 shows three causal trees. The nodes of the trees are represented as circles and squares. The 
squares represent evidence nodes \ i.e. nodes for which the state can be observed. The circles represent 
unobserved nodes. The triangles represent connection tensors. As can be seen, the daughters of a given 
parent are sometimes assumed to be marginally independent (when each link has its own tensor) and 
sometimes exhibit dependencies (when certain links share tensors). The trees are said to model the 



it 
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1* 2 




Tree 1 Tree 2 Tree 3 
Fig. 7. Three isolated trees spanning part of the observation sequence. 

observed nodes. The parameters of the model consist of the components of each of the connection 
tensors as well as the prior vectors of the root nodes (77,, tt 2 and tt 3 in Fig. 7). 

For simplicity we assume here that the evidence nodes coincide with the leaf nodes of the tree. The 
theorem that follows does not require this, but it makes the discussion easier. Evidence nodes are 
instantiated by providing their A vector. In the simple (deterministic) case this vector has the form 
(0,..., 0,1,0, ...,0) indicating by the position of the 1 the value of that node. It may however be any 
vector of non-negative real numbers. We could for example use the vector of word likelihood values 
calculated by a speech recognizer for each word of the recognizers vocabulary. 

An instantiation of the evidence nodes is called a sample. For a given sample, we can calculate the 
probability of this sample given the model parameters. To do this recall that the A and tt vectors at a 
node X were defined as 

A(*)=Pr(^ I*), (57) 

7r(X)=Pr(X\€^). (58) 
Thus 

A( X) • tt( X) = £ Pr(e> | X = i)Pr(X = / 1 e + ) (59) 

= LPr(e^ \X=i,e^)¥i(X = i\e^) (60) 

1 

= Pr(e^ |ejf). (61) 

Eq. (60) is justified, for the node X separates e% from e# and so the probability of e% only depends 
on the state of X, At the root node of a tree, e£= 0, so A -rr = PK<0, where e stands for the part of the 
evidence spanned by the tree. When there are multiple trees as in Fig. 7, the fact that these are not 
connected implies that they span independent parts of the sample. Thus the overall probability of the 
sample is obtained by multiplying the terms \(R) • tt(R) over all root nodes R. 

It is usually more convenient, however, to use the negative logarithm of this quantity. This is known as 
the entropy of the sample. We define the entropy of a tree T with root node R as 

MH" -log(A(tf).7r(/?)), (62) 

and the entropy of the sample as 



E(T)=- Z log(A(K) •*■(/?)). 

root nodes R 



(63) 
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1 state root node 




Tree 1 Tree 2 Tree 3 
Fig. 8. Combination of the three isolated trees of Fig. 7 into a single one. 

5.7. Parameterization of the model 

Fig. 8 shows how the isolated trees of Fig. 7 can be converted into a single equivalent tree. This is 
done by introducing a new root node R and connecting the previous root nodes as daughters to this 
node. The new root node represents a random variable that can only take one value (thus it is not 
random at all). The components of the connection matrix connecting R to one of the old root nodes R f 
are set identical to the components of the prior 7r, of R, in Fig. 7 (since R can only take one value the 
connection matrix is really just a vector). We do not need to consider the prior to /?, for since R only 
takes one value this prior is always equal to the unit vector (1). 

It should thus be clear that Fig. 8 is equivalent to Fig. 7. The advantage of this transformation is that 
we only need to consider one tree and that prior vectors are represented as connection tensors, so a 
re-estimation formula for the connection tensors will automatically apply to the priors as well. 

The total set of parameters to be estimated thus consists of the components of the various connection 
tensors in Fig. 8. 

Weight linkage. Since the latter part of this paper makes extensive use of a concept known as weight 
linkage we will introduce this concept here. We may impose additional constraints on a model by forcing 
certain connection tensors to have identical components. Thus if two tensors A x and A y of the model 
have the same rank (number of indices) and corresponding indices have the same dimension we can 
impose the additional constraint: 

A u t .J»= A hi for al1 ' ->i ->2 . ■ ■ • . h • ( 64 ) 

This reduces the number of free parameters of the model. 

In the following we will consider different models, which although having the same geometrical 
structure (or topology) differ in the actual parameter values. To differentiate between two such models 
we will use the letter 0 to indicate one parameter set and 0 to indicate another. Thus the probability of 
a sample e under model 0 will be written as PKe? \0) and that of e under model 0 will be written as 
Prie\0). More specifically 0 is defined as the set of all connection tensors of the model and we will 
write A e 0 to indicate that a given tensor A is part of model 0, Naturally 0 can be partitioned into a 
set of subsets 

h 

where each subset 0 h contains all tensors that are linked. (Linking of tensors really defines an 
equivalence relation on 0 and & h are the equivalence classes.) In slight abuse of notation we will also 
write 



(66) 



104 H. Lucke / Speech Communication 16 (1995) 89-118 

to indicate that tensor A f which links the parent node V to the daughter nodes U x to U n belongs to the 
tensors in 0 h , 

We can now state the main result of this section. 

Theorem 1. Let 0 be a parameter set for a model describing samples e of a training data set Train. Let E & 
be the entropy of the training data given the parameter set 0. Then if we choose a new parameter set 0 for 
the same model which decomposes equally into subsets 0 h of linked tensors such that for any tensor A e 0 h 
we have 

A >M... Jn = T r l n > (67) 

A J'n 



where 

C,7....,„= £ E Prlv-i, A U k =ik (68) 

ceTrain (AM.U^ U„)^& h ^ k ' 

and denote the entropy of the training data given 0 we have 

(69) 

The proof of this theorem is given in Appendix A. 

The usefulness of Theorem 1 should be clear. The partial contributions 

Pr(K = /\ A U k =j k 

\ k=i 

to the C h tensor can readily be calculated using Eq. (54). After processing the training data once, one 
merely needs to re-normalize the C h tensors to obtain a new estimate of the model parameters. The 
theorem guarantees that the new set of parameters models the training data better than the previous 
one. 



e,0\ (70) 



6. An application: learning the hidden structure of language 

The previous sections have been quite general in describing how probabilistic inferences can be made 
given a certain model (i.e. a graph) and also how the parameterization of the model can be learned even 
if some nodes are not observed. We will now turn to a specific application in which the developed ideas 
will be used. 

The particular problem studied is that of language modeling, i.e. providing a probabilistic model that 
assigns probabilities to sequences of which correspond to their relative frequencies. Such models are 
useful in speech recognition where they assist the speech recognizer by favoring likely word sequences 
over unlikely ones. 

The specific model adopted could be regarded as lying somewhere between the popular n-gram 
language models and a stochastic context-free grammar. The grammatical information is stored in the 
form of stochastic re-write (or produciton) rules, i.e. in rules of the form 

'"->*(/>)■ (71) 
where i is a non-terminal symbol and a is a string of non-terminal and terminal symbols and p is the 
probability of this rule being used given that / is a symbol to be rewritten. However, unlike a genuine 
context-free grammar, the observation sequence is not divided into sentences prior to parsing. Instead, 
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trees to be multiple paths 



processed next 




Ian ^ l, 



last point where I symbol 

processed all paths observation read 

tree converge sequence 



Fig. 9. This diagram illustrates the parsing algorithm utilizing the partial backtracking technique. When a new symbol is read all 
(newly) possible parse trees are constructed bottom up. This defines a lattice of trees. The possible paths through the lattice are 
traced and pruned such that only the best (i.e. minimal entropy) path for each end point is kept. At some earlier point (relative to 
the symbol just read) all these paths converge, identifying a unique sequence of trees. These trees are then chosen for the 
propagation and re-estimation algorithm. 



the algorithm finds its own segmentation of the observation sequence and builds parse trees separately 
for each segment. When applied to natural language such segments are usually smaller than sentences, 
but they may also be as large or even larger than the actual sentences. In short, the model is ignorant of 
sentence boundaries, such as would be denoted by full stops in the observation sequence. 

In the model rule productions such as (71) are viewed as causal relationships, the left-hand side 
causing the right-hand side. In this framework we will see that the ideas developed in the previous 
sections are directly applicable and allow us to not only parse the observation sequence but also learn the 
production rule probabilities from example text. 

6. /. Overview of (he algorithm 

To accomplish our task the model has to perform two tasks: 
1. The selection of the best segmentation points and the structure of the parse tree (the structure 



2. The calculation of the probabilities of the various terminal and non-terminal symbols in the parse 
trees (the assignment problem). 

It is clear that if the structure problem was somehow solved for us, we could use the theory of 
Bayesian trees developed earlier to solve the Assignment problem by a suitable propagation of A and 7r 
vectors. The re-estimation of grammar parameters could similarly be performed at the same time. Since, 
however, we are not given the answer to the structure problem we need to solve both problems 
simultaneously. At outline of this procedure will now be discussed. 

Fig. 9 shows a partially parsed observation sequence. Symbols are processed one by one left to right as 
they come in. Over the observation sequence a number of parse trees are built (and later pruned). Some 
of these are shown in the right-hand side part of Fig. 9. These trees form a lattice over the observation 
sequence and it remains to pick the best sequence of trees, i.e. the one that maximizes the likelihood of 
the observation sequence. This could be solved by dynamic programming if we were giving definite start 
and end points of this lattice. In our problem there are no such endpoints, as we are processing an 
infinite string of words. If, however, we impose a maximum on the size of the trees \ we can select trees 
among the best path after only processing a finite amount of the data using a technique known as partial 
backtracking (Brown et al., 1982). 



problem). 



1 By the size of a parse tree we mean the number of terminal symbols it spans. 
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6.2. A specific grammar model 

To solve the structure problem we need to essentially search through the space of all possible parse 
trees and select the most suitable. In order to keep the search space as small as possible we will require 
the grammar to be in Chomsky Normal Form. 2 

It is well known that any context-free grammar can be written in Chomsky Normal form (by increasing 
the number of non-terminal symbols if necessary), so the Chomsky normality constraint does not impose 
any structural constraints on the language, but only on the parse trees by requiring these be essentially 
binary, each node having two daughters except for the so-called pre-terminal nodes which have only one. 
It also allows us to represent the grammar in the following quantities: a tensor of rank 3, a matrix and a 
vector. We write 

A iJk : the probability of non-terminal symbol / re-writing to the two non-terminals j and k, 
B im : the probability of the non-terminal / re-writing to the terminal symbol m, and 
P { : the prior probability of the first symbol in a derivation (the root). 
These three quantities must satisfy the stochastic constraints: 

E^*+£Si>,-1 for alii (72) 

jk m 

and 

E^-l. (73) 

For a given parse tree, we can now apply the belief propagation algorithm. The tensors A and B serve 
as connection tensors and P provides the prior to the root node of the tree. In fact, any given tree will 
contain many instances of the A and B tensors, one for each terminal and non-terminal production, 
respectively, all of which are linked. The non-terminal nodes are all unobserved and are best thought of 
as random variables that may take the non-terminal symbols as values. Similarly each terminal node 
represents a random variable ranging over all terminal symbols. These terminal nodes are all observed. 
They are instantiated from the observation sequence by assigning them A vectors of the form 
(0, ...,0,1,0, ...,0) (called indicator vectors), where the position of the i indicates the identity of the 
terminal symbol. 

6.3. Convergence of the training method 

In Section 5 we showed that the iterative training method for the model parameter converges. 
However, this theorem was posed under the assumption that the tree structure connecting the nodes is 
given and fixed. So the question arises whether the same result is applicable to the situation described 
above, where this structure is chosen dynamically and may be different from iteration to iteration. 

A moments thought will reveal that a sufficient condition for its applicability is the fact that trees are 
selected on the basis of maximizing the overall likelihood of the training data. To see this suppose that in 
a given iteration the tree topology T, is chosen as optimal. Then Theorem 1 guarantees that the 
likelihood of the data for this topology will increase. The likelihood for other topologies may increase or 
decrease as the case may be. If in the next iteration another tree topology T 2 has an even higher 



" In Chomsky Normal Form (Chomsky, 1959) the only rules that are allowed are the ones for which a in equation production 
either equals a string of two non-terminal symbols or else is a single terminal symbol. 
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(a) 



(h) (r) 
Fig. 10. The tree building (parsing) algorithm. 



likelihood than that of T l9 it will become the preferred topology. So in any case the likelihood of the data 
in the new iteration will not have decreased. 

6.4. Tree building methods 

As was explained above, a lattice of trees is constructed during the parsing of the utterance, from 
which the best sequence is selected. We will now turn to the problem of constructing the tree lattice. 

6.5. Tree construction; version I 

Since the grammar model is constrained to be in Chomsky Normal Form, the topologies of the trees 
are fairly constraint. In particular each terminal symbol has a unique "pre-terminal" node as its parent. 
These nodes may be created immediately, as soon as the terminal symbol has been observed (Fig. 10(a) 
and (b)). Next the A vector, which has been instantiated for the terminal node can be propagated up to 
the pre-terminal node. The general propagation equations were given in section extensions (Eqs. (19) 
and (27)), but in this simple case these collapse to a simple matrix-vector multiplication: 



where U is the pre-terminal node and X is the terminal node. The new vector A(LO is independent of 
the shape of the eventual parse tree, so it can be calculated at this stage and need not be calculated 
again (Fig. 10(b)). 

The newly created pre-terminal node can be regarded as a tree of size one, and we can calculate the 
entropy of this tree using Eq. (62). Next, for two neighboring pre-terminal nodes, a non-terminal node is 
proposed (Fig. 10(c)), having the two pre-terminals as daughters. Again, the A vector may be passed up 
this new node. The propagation equation follows again from Eqs. (19) and (27) and in this case read 



where V is parent node and U x and U 2 are the two daughters. It is not yet clear whether this 
non-terminal will be part of the final tree, but if it is then again the its A vector will not change as it 
depends only on its daughters. Again we can calculate the entropy of this tree (of size 2) using Eq. (62). 
If the entropy calculated is larger than the sum of the entropies calculated previously for the two 
daughter nodes, the new node is rejected and removed from memory'- Otherwise it is kept as a potential 
node for the final tree. The process continues now in the same fashion by constructing new non-terminal 
nodes between any pair of existing non-terminal nodes which span adjacent sequences of the observation 
sequence (Fig. 10(d)). 



A(U)=B\(X), 



(74) 



(75) 
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Fig. 11. Illustration of version 2 of the training algorithm. 



Continuing this procedure leads to a lattice of nodes. While trees are being constructed they can also 
be pruned using the partial backtracking technique mentioned earlier. 

6.6. Problems with version 1 of the tree construction algorithm 

The algorithm above has a serious deficiency that must be rectified before the algorithm can be used 
in practice: it is not capable of learning the succession of two non-terminal symbols unless they occur in 
the same tree. In particular during the first iteration the grammar parameters are still random and each 
tree has size 1. Therefore, with the above training method as it stands the probabilities of the A tensor 
cannot be learned. 

By modifying the tree building algorithm we can however avoid this problem. 

6.7. Tree construction: version 2 

In version 2 of the tree construction algorithm, trees are constructed and isolated in the same way as 
in version 1. However, prior to calculating the partial contributions to the weight re-estimation term (Eq. 
(55)) two neighboring trees are combined into a large tree by hypothesizing a common root node. Fig. 
11(a) illustrates this, during the first iteration, when all constructed trees have size 1. After all weight 
contributions have been calculated, the hypothetical root node is removed and the next two trees are 
combined to a new tree (Fig. 11(b)). (The tree who formed the right branch of the previous construct now 
forms the left branch.) Training is repeated on this new construct. In this way, the information about 
neighboring trees is learned by virtue of the hypothetical nodes, and hence "tree-growtrT is possible 
during training. 

6. 7. /. Allowing multiple priors 

Version 2 of the tree building algorithm depends upon the initial symbol distribution vector P. An 
additional reduction in entropy can be achieved by allowing this vector to vary according to size of the 
tree to which it is applied. 

In this scenario, rather than storing just one vector P, we use an array of vectors pr i oris]. For a tree 
of size s (i.e. which spans s symbols of the observation sequence) the vector prior[s] is used as the 
prior for that tree. 
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Fig. 12. Illustration of version 3 of the training algorithm. 



6,8. Tree construction: version 3 

Version 3 of the tree building algorithm is very similar to version 2. To understand it let us first 
consider an extension to the stochastic context free grammar formalism. 

Up to now the grammar was described by three quantities: B im describing the terminal productions, 
A ijk describing the non-terminal productions and P t describing the prior probability of the root symbol. 
We now replace the prior P, by a matrix of conditional probabilities D ijy describing the value of the 
current root node given the value of the previous root node. To describe this more precisely, if we have a 
sequence of trees 7\,... spanning the observation sequence, with root nodes /?,,..., then 



One way of looking at this is to say that we provide a bigram grammar for the root nodes of the trees. 
We now no longer have isolated trees, but rather one large structure. The A and rr vectors can be 
propagated through this structures just as the theory dictates (Section 4.1). Each root node has two 
groups of daughter nodes. One group consists of the two immediate daughters within the same tree. Its 
relation to the parent is described by the tensor A. The other group consists of a single node, the root 
node of the next tree, and the relation is described by the matrix £>. 

Fig. 12 illustrates how the A and tt vectors are propagated through the network. 

If this architecture is used for training, the D matrix learns the "root node bigrams". However, this 
knowledge has to be transferred into the A tensor, if the trees are to grow in subsequent iterations. 
Otherwise we are in a situation similar to the one of version 1 of the tree construction algorithm. 

In principle this problem can be solved by linking the parameters of the D matrix with those of the A 
tensor. But since they have different dimensions this cannot be achieved in quite such a straightforward 
way. However, the entries of the D matrix can be calculated from the entries of the A tensor by 
re-introducing the initial symbol vector P. 

In version 2 of the tree building algorithm we hypothesized a new root node H for any pair of 
adjacent trees with root nodes R x and R 2 . Thus, the tensor A learns the conditional probabilities 
Pr(/? 1 ,/? 2 | H). Since the prior of H is given by the initial symbol distribution vector P, we can get the 
joint distribution of PriR x ,R 2 ) by multiplying A by P: 



(76) 



(77) 



and from this we can calculate the conditional probabilities by simple normalization: 




(78) 
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Hence we can calculate D from A by setting 



/ 



(79) 



The tree-building and learning algorithm can now be described as follows. At the beginning of the 
iteration the D matrix is calculated from the A tensor and the P vector according to Eq. (79) above. 
Trees are then constructed using version 1 of the tree building algorithm. The root nodes are then linked 
using the D matrices as shown in Fig. 12 and the A and tt vectors are propagated through the entire 
network. When it comes to calculating the contributions to the re-estimation terms (Eq. 47) the D 
matrices are again replaced by the A tensors and the A vectors from the two root nodes plus the P 
vector are used to calculate the contribution to the A tensor at this point of the structure. 

6.9. Discussion on tree building algorithms 

In the previous subsections we have presented three closely related algorithms for parsing the 
observation sequence. Of these, version 1 does not work during the training period, because the tree are 
not encouraged to grow. (It is however a useful algorithm once the grammar parameters have been 
learned.) The main reason for including it here was its simplicity and the fact that it forms the basis for 
the other two algorithms. 

Version 2 is a simple extension of version 1, that does allow trees to grow from iteration to iteration. 
The drawback of this algorithm is that training is no longer performed on the structure exhibiting 
minimal entropy, but rather on a modification of this structure. The fact that the minimal entropy 
structure has been modified implies that the convergence argument given in Section 6.3 does no longer 
strictly apply. So the overall entropy calculated over the observation data may no longer be strictly 
decreasing from iteration to iteration. However, experiments have confirmed that this is not a problem in 
practice. 

Version 3 is a neat extension of version 2 which extends the grammar model by a bigram grammar 
between root nodes, so that the trees are no longer independent of each other and in the process 
provides a justification for the hypothetical root nodes. 

For brevity, we refer to the three algorithms collectively as BLI (Bayesian Language Inference) 
algorithms. 

One may ask why sentence-end information is ignored in the algorithms. We believe that Bayesian 
networks are very flexible tools and many information sources can be incorporated if and when they 
become available. For example, if partial bracketing information is available, the tree structures could be 
forced to be consistent with these. If the identity of certain non-terminal symbols are known, the 
corresponding nodes could be clamped to these values. Long distance semantic or pragmatic information 
could be made available in the form of priors. 

Here, we present the algorithm in its purest form. The fact that it does not require sentence 
boundaries distinguishes it from other approaches and is regarded an asset. Many text databases are not 
provided with explicit sentence-end markers. (Periods occur at the end of a sentence but also elsewhere.) 
If the algorithm is to be used in an application, all available knowledge sources could, and indeed should, 
be realized. 

6.10. Interface with a speech recognizer 

In this paper we only apply the model to symbolic sequences. However, we will briefly describe how 
the algorithm can be naturally integrated with an HMM type speech recognition front end processor. 

The Viterbi decoding algorithm (or the forward calculation) produces likelihoods of the form 
PiO | M ) of the probability of the acoustic observation O given the (Markov) model Af. This probability 
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Table 1 

Comparison to the Inside-Outside algorithm 





Inside-Outside 


BLI model 


Inside 

probabilities 


/ - 1 

eis, f, 0 = £ £ A ijk e ^ r * JM r + 1, *) 

jk r - s 


A(u)i r = L^.7* A((;) / A( A 


Outside 
probabilities 


fts. f. i>- Ej L A jki f(r t t. y>(r, 5 - 1, *) 

jk \r-l 

r-/ +1 / 


xr( l- )j = a J^7r( u ) ; A( * )* 

/* 


Sentence 
boundaries 


required 


not required 


Tree size 


is sentence length 


not fixed (grows during training) 


Complexity 


cubic in the sentence length 
cubic in number of non-terminals 


linear 

cubic in number of non-terminals 3 


Initial 
symbol 


always "5" 
prohibits tree growth 


any non-terminal 



corresponds directly to the diagnostic probabilities A(s t ) = P(e~ Is,). Here the evidence e~ is identical 
to the acoustic observation O and the terminal symbol s, corresponds to the model M. Thus, when 
probabilistic information like the calulation of an HMM is available, we can substitute the indicator 
vector by a vector containing the HMM likelihoods as the A vector of the leaf nodes. 

6J1. Relation to the Inside-Outside Algorithm 

The Inside-Outside Algorithm (Baker, 1979; Lari and Young, 1991) is a similar algorithm that is 
capable of inferring stochastic context-free grammars in Chomsky Normal Form from example text. 
Table 1 shows the main differences between this algorithm and the method reported here. 

We consider the Bayesian framework described in the first part of the paper more flexible than the 
Inside-Outside algorithm. Not only does it lead to simpler propagation equations by just considering one 
parse tree rather than all, it is also considerable more general with language modeling as just one 
application. In particular, it allows a number of variations of the basic language model such as the 
inclusion of a bigram grammar between the root nodes of the trees as was described in version 3 of the 
tree building algorithm. Of course the BLI algorithm could also easily be changed to honor sentence 
boundaries in the same manner as the Inside-Outside algorithm does. 

However, we are here not concerned with the linguistic analysis of a sentence but with the provision of 
a stochastic language model for speech recognition. For this purpose the limited tree size may be entirely 
adequate, while training larger trees will require considerably more training material, consume more 
processing time per terminal symbol and runs the risk of producing sub-optimal solutions due to the 
added degree of complexity. 

6.12. The complexity of the algorithm 

The complexity of the algorithm is cubic in the number of non-terminal symbols 3 and linear in the 
length of the observation sequence. In comparison, the complexity of the Inside-Outside Algorithm is 



An improved version which is only quadratic in complexity in the number of non-terminals has recently been developed (Lucke, 
1994). 
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cubic in both the number of non-terminals and the sentence length. We found experimentally the 
following average times on a DEC-station 5000/240 for processing one symbol of the observation 
sequence: 



Number of 


Training 


Testing 


non-terminals 






30 


0.63 sec 


0.57 sec 


50 


3,09 sec 


2.74 sec 



These training times were measured during the end of the training. At the beginning, processing is 
considerably faster as the trees are much smaller. 

Since the training procedure is iterative, the required number of parameter updates until convergence 
is also a crucial factor in determining the algorithms complexity. We know of no theoretic result that 
asymptotically describes the rate of convergence. Experimentally we found that on noisy data 4 about 100 
parameter updates were required. 

7. Experimental work 

7.1. Database 

The experiments were based on the the ATR Dialogue database. This is a text database consisting of 
some 8500 Japanese sentences of telephone conversations together with their part of speech labels 
(Ehara et al., 1990). The database contains a total of about 6500 different words with 51 different parts 
of speech. The model is capable of learning stochastic re-write rules from an unlabeled sequence of 
symbols. However, the number of terminal and non-terminal symbols needs to be chosen in advance. To 
ensure accurate estimation of the production probabilities it is required that each non-terminal symbol 
occurs reasonably frequent. For this reason we used the part of speech symbol sequence rather then the 
word sequence as the input to the model. In this way enough training material for each input symbol was 
ensured. 

For the purpose of the experiment the data base was divided into two equal portions, one for training 
and one for testing. 

7.2. When to perform parameter updates 

In HMM training, it is customary to update the HMM probabilities each time the entire training data 
has been processed. In the experiments reported here we update more frequently. Thus, successive 
training epochs are carried out on different part of the training data. This is motivated by the following 
observation: the grammar inference mechanism described here only requires an unlabeled source of 
symbols to operate. Even though training is performed on a finite amount of training data for the 
experiments described here, it is feasible that the algorithm could be used on an infinite symbol source. 
At present when the end of the training data is reached the algorithm starts again at the beginning of the 
training data. With an infinite symbol source one could indefinitely perform incremental improvements 
to the weights by continuously processing the output of the source. 



4 The term "noisy data" here means that different training epochs are performed on different parts of the training data. 
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Initially when the model parameters are still random, only trees of size 1 are constructed (see 
experimental section and Fig. 14). During this phase essentially the uni-gram statistics of the data base 
are learned. It follows that it is not necessary to have a very long update period during this time of 
learning. As the sizes of the trees increases, the model parameters can only be accurately estimated by 
taking more and more data into account during one training epoch. Thus it seems advantageous to start 
with a relatively small update period and increase this gradually. This was done in the experiments, 
starting with an update period of 100 symbols, which was increased by 30 after each update. 

7.3. Comparison of tree building methods 

In the first experiment the three parsing algorithms were compared. For this purpose we trained 
models with 30 non-terminal symbols. The results are as follows: 



Method 


Tree size 


Train 


Test 


Version 1 


1.0 


4.48 


4.46 


Version 2 


2.81 


2.81 


2.80 


Version 3 


2.94 


3.88 


3.87 



The table shows the average size of the trees constructed by the algorithm and the entropy over the 
test and training data. As was pointed out earlier, version 1 of the parsing algorithm does not lead to tree 
growth and was included as a control. 

Somewhat surprisingly version 2 performed significantly better than version 3, despite the fact that 
version 3 provides a bigram type grammar for the root nodes of the trees. 

7.4. Evaluation of the BLI model 

In order to evaluate the performance of the BLI model, a version with 50 non-terminal symbols were 
trained. The number 50 was chosen for two reasons: firstly it roughly agrees with the number of terminal 
symbols (51) and secondly it was the largest number that could be trained in reasonable time on a 
workstation. Following the results of the previous section we used version 2 of the tree building 
algorithm. Other model parameters were chosen as follows: 



Number of terminals 51 

Number of non-terminals 50 

Initial update period 100 

Update period increase 30 

Maximum tree size 6 




SO 100 ISO 0 SO 100 ISO 

Number of symbols processed (1000s) Number of symbols processed (1000s) 



Fig. 13. Development of the entropy di#igg!4aififattgaverage tree size during training. 
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»b - 0 

0 

Fig. 15. Example of parsed text using the algorithm. The text roughly translates to "And then you take a taxi from Kyoto station all 
the way to the meeting place. It will be about 1000 yen". 



Table 2 

Experimental results 



Model entropy 


Bigram 


Trigram 


BLI 


Training data 


2.83 


2.39 


2,60 


Test data 


2.84 


2.76 


2.70 



Fig. 13 shows the development of the entropy when estimated during training and Fig. 14 shows the 
average tree size during each training epoch. Initially the entropy was close to 6, but dropped sharply to 
the "unigram-Iever of about 4.5 after the first few iterations. During this period the tree size was 1. 
When trees started to be constructed the entropy dropped further. Both curves are quite noisy. This is 
mainly attributed to the fact that different training epochs were carried out on different parts of the 
training material. 

Training was discontinued after 300,000 symbols had been processed. The trained model was 
subsequently evaluated on the test data and an entropy of about 2.7 bits was observed. This was 
compared to the entropies obtained from a bigram or trigram grammar 5 . The overall results are 
summarized in Table 2. 

As can be seen, the BLI model outperforms the bigram model, but its entropy is similar to the trigram 
entropy. On the training data, the BLI entropy is considerable higher than the trigram entropy. The 
difference in entropy between the training and test data is also smaller for the BLI model. This means 
that the generalization is higher for the BLI model and should scale better to larger tasks once the 
problems concerning training time have been overcome. 

Fig. 15 shows how the BLI algorithm processed part of the training material. It shows the part of 
speech sequence that was used as input to the model together with the parse trees constructed. The 
underlying word sequence and a translation was included as a reference. 



5 The /f-gram grammars were smoothed using a constant floor level that was obtained from "held-out" data. 
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A visual inspection revealed that the parse trees tended to correspond to Japanese grammatical units 
such as the bunsetsu. 6 However, in many cases a non-alignment was also observed. In particular since 
the sentence end marker (period) is processed just like any other symbols, it is also incorporated in the 
tree structures. Occasionally such a tree combines the end of one sentence with the beginning of the 
next, contrary to common linguistic analysis. 



8. Discussion 

In this paper we have shown that parse trees may be regarded as Bayesian networks. This framework 
enables one to integrate rule probabilities of stochastic grammars and observational uncertainties (such 
as acoustic match likelihoods) in a mathematically sound and computationally efficient way. Moreover, 
we have shown that the causal and diagnostic support vectors that arise as intermediate quantities in the 
Bayesian formalism may be used, in a natural way, to calculate new grammar parameter estimates. The 
parameter re-estimation is guaranteed to increase the likelihood of the observed data, as long as the 
parse trees are selected on a maximum likelihood principle. 

The main advantage of this procedure over similar algorithm such as the lnside-Outside algorithm 
lies in its flexibility. In this paper we have exploited this flexibility by building a language model that 
learns grammatical constructs "bottom-up'\ i.e. is not forced (or limited) to recognize whole sentences. 
We have also shown how a bigram grammar across the root symbols can be integrated easily and 
naturally and how probabilistic scores of an HMM could naturally be utilized. 

In the future it seems feasible that additional knowledge sources such as semantic or pragmatic 
information could be integrated to further assist the speech recognition and parsing process. Already 
Bayesian networks are used to represent semantic knowledge in probabilistic expert systems. By 
formulating the parsing mechanism in the same language a powerful integration of various knowledge 
sources may result. 

The building and selection method is currently not truly satisfactory. Firstly it only considers the best 
sequence of trees (an invitation to local (i.e. non-global) maxima). Also it was shown that the maximum 
likelihood selection criterion (required in the convergence proof) had to be sacrificed in order to allow 
tree growth. This difficulty is reflected in the Al domain where at current Bayesian Networks are 
handcrafted - no truly automatic process exists. 

Despite the simplicity and weaknesses of the parsing algorithm we were able to demonstrate 
experimental results which were superior to the established bi- and trigram grammars. The better 
generalization ability that was observed is also notable. 

It is however important to consider the limitations of the technique. These are mainly due to the 
computational load incurred. The algorithm's complexity is cubic in the number of non-terminals 7 , 
limiting the number of non-terminal symbols that can be used. In this paper we have considered discrete 
sequences of words. This is applicable for isolated word recognition systems, i.e. systems in which the 
word boundaries are known. Processing continuous speech using the BLI model is in principle possible: 
one would need to construct the tree lattice on a frame-by-frame (rather than word-by-word) basis. Even 
in relatively efficient schemes such as the one proposed by Ney (1991), this leads to a significant increase 
in the computational load. Another possibility would be to employ the method as a post-filter in order to 
re-order a list of candidate word strings found by a speech recognizer. This could be done relatively 
efficient using a stack decoder (Jelinek, 1969; Paul, 1991), although the A* search procedure would have 



h Bunsetsu are phrase-like constructs which are the dominant syntactic category in Japanese. 
7 Bui see Footnote 3. 
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to be replaced by a non-admissable search to allow the processing of larger contexts (Paul, 1992). It is 
our opinion that this difficulty for processing continuous speech mainly arises from the fact that the 
Viterbi algorithm can essentially only incorporate regular grammars (such as «-gram). An algorithm for 
decoding HMM that is more tolerant towards more general type grammars would be desirable. 
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Appendix A. Proof of Theorem 1 



This proof is a generalization of the corresponding one for Hidden Markov Models (Huang et al., 
1990). Both learning procedures are applications of the more general EM algorithm (Dempster et al., 
1977). We will first consider the case in which the training data Train consists of just one sample e. Let 

denote the total set of nodes in the model and let 

S'.A'^U (80) 

be a function that assigns a particular value to each node in the model and let S be the set of all such 
functions' Further for the sample e and a value allocation 5 we denote 

Pr(e, S\0) =Pr(e, A U = S(U)\o). (81) 

A moments thought will reveal that 

Pr(e,S|<9) = n stu H > n A(A-) jm . (82) 

(A:V,V [/,)£» leafnodesA - 

Now for two different sets of parameters & and & define the function 

Q(0 '® )= Pr77|0y? Pr(e ' 5|6>) '°g pr ( e ' 5 l®)> ( 83 ) 
and consider the term - Eg. By the concavity of the log function we have 
Pr(e\0) 

£ «- £ * = log lM7il^ (84) 

, /_ Pr(e,S|6>) Pr(e,5|0) \ 
= 1 ° g [? Pr(e|e>) Pr(e,5|6) J < M > 

_ Pr(e,5|0) Pr(e,S\0) 
> ? Pr(e|©) ,08 Pr(e,S|6>) (86) 

= Q(0,&) -Q(&,&). (87) 
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It follows that a sufficient condition for E$<E e is that Q{0,0) ^ Q(0,0). This latter condition is 
clearly satisfied for a 0 that maximizes Q(0, * ). In fact we have strict inequality: 

E$<E & , (88) 

unless 0 itself maximizes Q(0r ). We therefore ask for the set of parameters 0 that maximizes Q{0,* ) 
for a given 0. From Eq. (82) we have 

logPr(e,S|0) = E \ogA S{V/hSiU0 SiUn) + E logA(*)*;n- (89) 

( A: V,U X .... ( (/' n )69 leaf nodes * 

Substituting this into Eq. (83) and changing the order of summation gives 

Q(&,e)= E E c*. log^,,...,^ E E^(^), iogA(jr),, (90) 

& h <2&iJ\ leaf nodes A' i 

where the tensor A u j is the representative tensor of the class & h , 

_ _ " Pr(e,S|(9) 

E E Pr(5|e,©) (92) 



and 



E Pr h/-i\ A iwJe.e) ^ (93) 

(A:V,U t U„)e& h \ k = l 



D(X)i=Zl{S{X)=i)Pr(S\e 9 e) (94) 

s 

= Pr(X=i\e,0) (95) 
= BEL s {X) i , (96) 
The second term in Eq. (90) is independent of 0, so it remains to maximize 

E E Ct .jJogA,,^, (97) 

e h a& — in 

with respect to the components of the representative tensor A Uj Jn of 0 h subject to the constraints 

E ^ 7l ...y.-l- ( 98 > 

7'n- 

It is easy to show (for example by using Lagrangian multipliers) that this maximization is in fact solved 

by 

eft 

J" = (99) 

A ?n 

This completes the proof of the Theorem for the special case of there being just one sample e in the 
training set. In the case of n samples, we simply create one large sample by writing all n samples one 
after each other. We use a model which consists simply of n copies of the model for one sample, with the 
corresponding tensors linked across all copies. This large model applied to the large sample is equivalent 
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to the original model applied repeatedly to all samples in the training data. Carrying out the same 
analysis as above for the large model leads to Eqs. (67) and (68) as claimed. 
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